In this project, we aim to predict both future salary and earned run average (ERA) of a MLB pitcher using statistics from their past year. The rationale behind this exploration is that, we believe, baseball general managers can be enamored with statistics that are ostensibly important but do not actually predict future performance. For example, a pitcher's total wins is a flashy statistic but the outcome of the game is depedent on several factors outside of the pitcher's control such as their team's offense as well as the opposing pitcher's performance. Thus, the ultimate goal of this project is to investigate if baseball general managers are valuing and paying for the "correct" pitching statistics or the statistics that are actually correlated with or predictive of future performance.

Salary Data Exploration

For data exploration, we will explore the correlations and associations between pitching statistics from the 2015 season and 2016 salary of 220 MLB pitchers in order to determine if performance in the past year can predict future salary.

total_2015_filtered <- read_csv("total_2015_filtered.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   player_name = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Basic Pitching Statistics

Wins and losses constitute the most basic of pitching statistics. Total number of pitches is pretty self explanatory too. Additionally, raw, unadjusted ERA, or the average number of runs a pitcher allows in a game is another predictor that one would expect to be associated with salary. Lastly, the number of earned runs, in a similar vein as ERA, is another predictor that stands out as potentially correlated.

ggplot(total_2015_filtered,
       aes(
         x = `Total Wins`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Total Wins`,
    log(total_2015_filtered$salary))
## [1] 0.4519149
ggplot(total_2015_filtered,
       aes(
         x = `Total Loss`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Total Loss`,
    log(total_2015_filtered$salary))
## [1] 0.4172997
ggplot(total_2015_filtered, 
       aes(
         x = Pitches,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$Pitches,
    log(total_2015_filtered$salary))
## [1] 0.4850451

Unsurprisingly, win stand out with a strong, positive correlation with future salary. Somewhat surprisingly, losses also have a positive, moderately strong correlation. This apparent inconsistency could be due to the fact that better pitchers pitch more games and thus accumulate more losses.

Total number of pitches has a moderate, positive correlation with salary. As we will see later on, pitching statistics having to do with productivity or volume of pitches are often quite highly correlated with salary.

total_2015_filtered <- 
  total_2015_filtered %>% 
  mutate(Win_Percent = `Total Wins`/(`Total Wins`+`Total Loss`)) %>% 
  drop_na()

ggplot(total_2015_filtered, 
       aes(
         x = Win_Percent,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$Win_Percent,
    log(total_2015_filtered$salary))
## [1] -0.1146149

Surprisingly, a pitcher's win percent has almost no correlation at all with salary and if any relationship exists it appears that it is a negative association. Rather, it seems like total number of wins and losses is positively correlated with salary, regardless of one's actual win percentage.

ggplot(total_2015_filtered,
       aes(
         x = `ERA`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`ERA`,
    log(total_2015_filtered$salary))
## [1] -0.004887819
ggplot(total_2015_filtered,
       aes(
         x = `Total ER`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Total ER`,
    log(total_2015_filtered$salary))
## [1] 0.4164102

The lack of a negative association between ERA and salary is surprising as giving up less runs is an unequivocally positive outcome. Additionally, in a trend that we will see throughout this project, earned runs is, surprisingly, positively correlated with salaries. There are a couple of reasons why this could be. The first is that only good pitchers are allowed to accumulate a lot of earned runs, less accomplished pitchers would simply lose their starting position. Additionally, it seems like total volume of pitches, regardless of pitch outcome, is generally positively correlated with salary.

Season Totals and Percentages

While the total number of strikes, balls, and hits might have a dubious relationship to future performance. It is undeniable that season totals can be quite flashy and can lead to lucrative contracts thus might be strong predictors of next year's salary.

ggplot(total_2015_filtered,
       aes(
         x = `Total Strikes`,
         y = log(salary))
       ) +
  geom_point(alpha=0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Total Strikes`,
    log(total_2015_filtered$salary))
## [1] 0.4859769
ggplot(total_2015_filtered,
       aes(
         x = `Total Balls`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Total Balls`,log(total_2015_filtered$salary))
## [1] 0.452774
ggplot(total_2015_filtered,
       aes(
         x = `Total Hit`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Total Hit`,
    log(total_2015_filtered$salary))
## [1] 0.476637

Unsurprisingly, seaon totals in general have a strong, positive correlation with salary, even undesirable outcomes like hits and balls are quite highly correlated. In terms of correlation, total strikes has the greatest R value but all 3 are quite close.

Next we looked to see if the percent of strikes, hits, or balls has a greater correlation than just raw totals.

ggplot(total_2015_filtered,
       aes(
         x = `Strike Percent`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Strike Percent`,
    log(total_2015_filtered$salary))
## [1] 0.01479224
ggplot(total_2015_filtered,
       aes(
         x = `Ball Percent`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Ball Percent`,
    log(total_2015_filtered$salary))
## [1] -0.1673246
ggplot(total_2015_filtered,
       aes(
         x = `Hit Percent`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Hit Percent`,
    log(total_2015_filtered$salary))
## [1] 0.1421157

Interestingly, the percentages are far less correlated with salary than the totals. The strongest correlation was with percent of pitches that are balls, with a moderate-weak, negative correlation, which is reasonable as balls are something pitchers try to reduce. Surprisingly, the percent of pitches that result in hits is positively correlated with salary, albeit with a very weak correlation. This relationship cannot be explained as easily.

Pitch Averages

Next, we examined if the averages of certain statistics were correlated with salary, specifically looking at average speed, average effective speed, as well as average spin rate, horizontal break, and vertical break.

ggplot(total_2015_filtered,
       aes(
         x = `Average Pitch Speed`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Average Pitch Speed`,
    log(total_2015_filtered$salary))
## [1] -0.2012988
ggplot(total_2015_filtered,
       aes(
         x = `Average Effective Pitch Speed`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Average Effective Pitch Speed`,
    log(total_2015_filtered$salary))
## [1] -0.1627285

Naively, one might expect faster pitches to be harder to hit and thus expect pitchers with higher average velocities to be handsomely paid. However, the negative correlation, with a not-insignificant correlation, suggests that average pitch speed is actually inversely related to salary. There are a couple of reasons why this might be. The most obvious is that pitchers who lack speed often make up for it with stellar control or a wide variety of possible pitches. Another reason is that, just instinctively, average pitch speed might be correlated with certain negative predictors like more home runs as batters can often make hard contact against fast but poorly-placed pitches.

ggplot(total_2015_filtered,
       aes(
         x = `Average Spin Rate`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Average Spin Rate`,
    log(total_2015_filtered$salary))
## [1] 0.04215349

Even though spin rate is something that pitchers place a lot of emphasis on, it seems like there is no strong correlation between salary and average spin rate. This is confirmed both quantitatively through the R value but also visually through the plot, despite the recent interest in spin rates, it seems like its not a great predictor of salary.

ggplot(total_2015_filtered,
       aes(
        x = `Average Horizontal Break`,
        y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Average Horizontal Break`,
    log(total_2015_filtered$salary))
## [1] 0.05052511
ggplot(total_2015_filtered,
       aes(
         x = `Average Vertical Break`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Average Vertical Break`,
    log(total_2015_filtered$salary))
## [1] -0.04485314

Another topic that has garnered more interest in the baseball community recently due to advances in terms of pitch tracking is horizontal and vertical break or the degree of movement of the ball in flight. Looking at the spread of the data and the low R value, it seems like, despite the emphasis certain people put on these two statistics, neither is a good predictor of salary.

Advanced Statistics

Lastly, we get to our advanced statistics, namely batting average against (BAA), slugging percentage against (SLGA), weighted on base average against (wOBAA), batting average on balls in play against (BABIPA), as well as some of our own creations such as "Corner %", which serves as a proxy for pitcher control, and "Barrel %", which gives us an understanding of the percentage of pitches that result in hard contact.

First we looked at at BAA, SLGA, BABIPA, and wOBAA.

ggplot(total_2015_filtered,
       aes(
         x = `BAA`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`BAA`,
    log(total_2015_filtered$salary))
## [1] 0.06512834
ggplot(total_2015_filtered,
       aes(
         x = `SLGA`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`SLGA`,
    log(total_2015_filtered$salary))
## [1] 0.05012007
ggplot(total_2015_filtered,
       aes(
         x = `BABIPA`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`BABIPA`,
    log(total_2015_filtered$salary))
## [1] 0.03949315
ggplot(total_2015_filtered,
       aes(
         x = `wOBAA`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`wOBAA`,
    log(total_2015_filtered$salary))
## [1] 0.01525886

Shockingly, not a single advanced statistic is significantly correlated with salary. Going into this project, we had expected that advanced statistics were going to be worse predictors of salary than base statistics like wins but we had not expected a complete lack of correlation. This result suggests that either these statistics do not do as good of a job as one would have expected in quantifying pitcher performance or that general managers mistakenly overlook advanced statistics.

ggplot(total_2015_filtered,
       aes(
         x = `Corner %`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Corner %`,
    log(total_2015_filtered$salary))
## [1] -0.004861808
ggplot(total_2015_filtered,
       aes(
         x = `Average Barrel`,
         y = log(salary))
       ) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(total_2015_filtered$`Average Barrel`,
    log(total_2015_filtered$salary))
## [1] 0.04628606

We had created corner % in order to serve as a proxy for a pitcher's control by quantifying what percent of their pitches end up in one of the corner zones rather than right down the middle. Barrel % was designed to represent the percentages of pitches that resulted in hard contact. However, neither of them had a significant correlation with future.

Correlations Between Predictors

total_quant_pred <- 
  select_if(
    total_2015_filtered,
    is.numeric
    )

total_quant_pred <- 
  total_quant_pred %>% 
  select(
    -yearID,
    -mlb_id,
    -W,-L,
    -`Average Balls`
         )

predictor_correlations <- 
  cor(total_quant_pred)

In the interest of saving some space, we will not be printing the entire correlation matrix. However, some salient trends stand out in terms of collinearity.

First of all, the "volume" predictors related to sheer number of pitches and games played are all very highly correlated to each other. Specifically, total pitches, total balls, total hits, total strikes, total wins, total losses, total ER all have high R values with each other. These are also some of the stronger predictors in predicting salary. In the interest of avoiding collinearity, we might want to combine these variables.

Interestingly, spin rate is positively correlated with strike percent and negatively correlated with hit percent. So even though its not a good predictor of salary, baseball pundits might be on to something in promoting this statistic as increasing strikes and reducing hits is a positive outcome.

Average barrel percentage has a medium positive correlation with 2 of our advanced statistics, SLGA and wOBAA, as well as ERA. This makes sense as harder contact means more runs allowed.

Strike, ball, and hit %'s were all negatively correlated with each other which is reasonable as they are mutually exclusive. Strike % also had high, negative correlations with the BAA SLGA and wOBAA statistics while hit % had a high positive correlation with those three predictors. This is understandable as those statistics try to quantify total offensive output, which consists of accumulating hits while avoiding strikes.

Our advanced statistics, BAA, SLGA, BABIPA, and wOBAA are all quite positively correlated with ERA. Given that the statistics all try to quantify offense and that ERA represents total offensive runs allowed, this relationship is to be expected. What is really notable is that ERA, as well as these statistics, were not a good predictor of future salary. However, considering that a pitcher's job is to reduce ER and keep their ERA low, the lack of a relationship between next year's salary and ERA/these advanced statistics is quite interesting.

An additional, interesting, observation is that ERA is positively correlated with number of losses with a medium degree of correlation (R=0.4), while not being correlated to number of wins (R=0.03). This suggests that a poor pitcher with a high ERA can easily lose games while a skilled pitcher with a low ERA cannot win games for his team alone.

Predicting Future ERA

<<<<<<< HEAD

Having explored possible predictors of salary (comparing both simple and complex variables), we move on to our prediction of future Earned Run Average (ERA). In this case, our selected response (ERA) can be thought of as a rough proxy for general pitching performance and outcome. Since the inception of the statistic in the early 1900s, ERA has been the most ubiqitous and cited measure of pitcher effectiveness, as it represents a straightforward calculation of the average amount of runs a pitcher allows over the duration of typical game (with run prevention thought of as the ultimate aim of pitching). Despite ERA being the most functional measure of pitching performance, it is nonetheless subject to a significant deal of random noise between individual pitchers' seasons. The development of advanced pitching analytics therefore may lend itself to more accurate forecasts of future ERA than simple counting statistics (such as wins, strikeouts, and innings pitched). With this in mind, we are motivated to find the statistical model that most accurately predicts ERA in a succesive season given a number of predictors generated from data in the current season. This should allow us to assess and predict player performance not based on actual (occasionally random) outcomes, but rather on an aggregation of predicted outcomes determined by a set of relevant statistical parameters. Ultimately, this model should lend itself to making informed salary decisions by MLB general managers.

pitcher_data <- read_csv("Mega Summary.csv")
pitch_data <- read_csv("2020_summary.csv") %>%
  rename(`Pitch Category` = pitch_category)

Ostensibly, the most obvious predictor of forecasted ERA is current ERA, as we can intuitively expect pitchers who performed well in a given season to also perform well in subsequent seasons.

pitcher_data %>%
  ggplot(aes(x = ERA, y = `ERA (t+1)`)) +
=======

Having explored possible predictors of salary (comparing both simple and complex variables), we move on to our prediction of future Earned Run Average (ERA). In this case, our selected response (ERA) can be thought of as a rough proxy for general pitching performance and outcome. Since the inception of the statistic in the early 1900s, ERA has been the most ubiqitous and cited measure of pitcher effectiveness, as it represents a straightforward calculation of the average amount of runs a pitcher allows over the duration of typical game (with run prevention thought of as the ultimate aim of pitching). Despite ERA being the most functional measure of pitching performance, it is nonetheless subject to a significant deal of random noise between individual pitchers' seasons. The development of advanced pitching analytics therefore may lend itself to more accurate forecasts of future ERA than simple counting statistics (such as wins, strikeouts, and innings pitched). With this in mind, we are motivated to find the statistical model that most accurately predicts ERA in a succesive season given a number of predictors generated from data in the current season. This should allow us to assess and predict player performance not based on actual (occasionally random) outcomes, but rather on an aggregation of predicted outcomes determined by a set of relevant statistical parameters. Ultimately, this model should lend itself towards making informed salary decisions by MLB general managers.

pitcher_data <- read_csv("Mega Summary.csv")
pitch_data <- read_csv("2020_summary.csv") %>%
  rename(`Pitch Category` = pitch_category) %>%
  rename(wOBA = wOBAA)
pitch_data  %>%
  ggplot(aes(x = `Average Pitch Speed`, y = wOBA)) +
>>>>>>> 10ce5a14b5b3ac8398fed932a0f2c507c35fe9dc
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'
<<<<<<< HEAD

cor(pitcher_data$ERA, pitcher_data$`ERA (t+1)`)
## [1] 0.2037155

Despite the expected existence of a postive linear association between the two variables, the correlation between ERA and ERA in the subsequent season is somewhat weak. This can likely be explained by the aforementioned influence of randomness in pitching outcomes. This leads us to believe that other variables (or a combination therof) may have more predictive power in forecasting ERA than ERA by itself. First, we look at some of the same counting stats that were significant in predicting salary:

pitcher_data %>%
  ggplot(aes(x = W, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$W, pitcher_data$`ERA (t+1)`)
## [1] 0.03254124
=======

cor(pitch_data$`Average Pitch Speed`, pitch_data$BAA)
## [1] 0.08112214
pitch_data  %>%
  ggplot(aes(x = `Average Pitch Speed`, y = wOBA, color =`Pitch Category`)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = lm) +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

One thing we find in our exploration is that average pitch speed has no meaningful relationship with other measures of pitch performance (such as wOBA), even when the data is almost perfectly seperated by categorical variable level. This is also true of average spin rate, and the results are consistent with different response choices (BA, SLG, etc.). This leads us to conclude that pitch tracking data has no predictive power when it comes to modelling pitcher performance. This helps explain why Statcast uses batted ball variables (such as launch angle and exit velocity) rather than pitch tracking variables as the predictors in the calculation of their expected outcome statistics (such as xwOBA and xBA). Accordingly, we will most likely avoid using average pitch speed and average spin rate in our model and focus on other predictors of future ERA instead.

Ostensibly, the most obvious predictor of forecasted ERA is current ERA, as we can intuitively expect pitchers who performed well in a given season to also perform well in subsequent seasons.

>>>>>>> 10ce5a14b5b3ac8398fed932a0f2c507c35fe9dc
pitcher_data %>%
  ggplot(aes(x = SO, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'
<<<<<<< HEAD

cor(pitcher_data$SO, pitcher_data$`ERA (t+1)`)
## [1] -0.08567439

Interestingly, the counting statistics that were correlated with salary have little correlation with forecasted ERA. In fact, the positive correlation between wins and forecasted ERA suggests that the average forcasted ERA actually increases (albeit very marginally) as wins improve. This intimates that there are variables that will likely be more effective in predicting future ERA, which we explore below:

=======

cor(pitcher_data$ERA, pitcher_data$`ERA (t+1)`)
## [1] 0.2037155

Despite the expected existence of a postive linear association between the two variables, the correlation between ERA and ERA in the subsequent season is somewhat weak. This can likely be explained by the aforementioned influence of randomness in pitching outcomes. This leads us to believe that other variables (or a combination therof) may have more predictive power in forecasting ERA than ERA by itself. First, we look at some of the same counting stats that were significant in predicting salary, namely total wins, total pitches, and total strikes, which were the three predictors with the highest correlation (quantified through R value) with salary:

pitcher_data %>%
  ggplot(aes(x = W, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$W, pitcher_data$`ERA (t+1)`)
## [1] 0.03254124

Interestingly, the counting statistics that were correlated with salary have little correlation with forecasted ERA. In fact, the positive correlation between wins and forecasted ERA suggests that the average forcasted ERA actually increases (albeit very marginally) as wins improve. This intimates that there are variables that will likely be more effective in predicting future ERA, which we explore below:

pitcher_data %>%
  ggplot(aes(x = pitches, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$pitches, pitcher_data$`ERA (t+1)`)
## [1] 0.07691844

Similar to wins, we see close to zero relationship between number of pitches and ERA in the following year. If anything, the relationship appears to be positive, albeit incredibly weak, suggesting that more pitches in the previous year is correlated with both a higher salary and a worse performance.

pitcher_data %>%
  ggplot(aes(x = SO, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$SO, pitcher_data$`ERA (t+1)`)
## [1] -0.08567439

Lastly, we see strikeouts in the previous year shows a very weak, close to zero correlation with ERA. Here, the relationship is negative, which is what one would expect, however, the weakness of the association suggests that strikeouts are not as good of a predcitor of future performance as one would expect following the salary analysis.

From this analysis, we see that there is no relationship between the top three predictors of salary and the pitcher's ERA the next season. This already provides some evidence that GMs are not properly valuing the right attributes as the two most correlated with salary have no relationship to future performance (quantified through ERA). This finding also is intuitively not all that surprising, these basic counting statistics might seem impressive in a vaccum but provide no indication as to future performance. We hypothesize that advanced statistics such as xWOBA or xBA are going to be better predictors of future performance than the salary-correlated predictors.

>>>>>>> 10ce5a14b5b3ac8398fed932a0f2c507c35fe9dc
pitcher_data %>%
  ggplot(aes(x = `K %`, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$`K %`, pitcher_data$`ERA (t+1)`)
## [1] -0.3489183
pitcher_data %>%
  ggplot(aes(x = woba, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$woba, pitcher_data$`ERA (t+1)`)
## [1] 0.2586885
pitcher_data %>%
  ggplot(aes(x = xwoba, y = `ERA (t+1)`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$xwoba, pitcher_data$`ERA (t+1)`)
## [1] 0.3088845
<<<<<<< HEAD

Already, we see that some of the slightly more advanced analytics have stronger correlations with future ERA than the more rudimentary counting statistics. Additionally, we see that expected weighted on-base average (xwOBA) outperforms weighted on-base average (wOBA). Unlike wOBA, xwOBA attempts to mitigate some of the stochasticity in pitching outcomes by aggregating batted ball data to predict outcome. In some ways, we can think of xwOBA has a more true measure of pure pitching performance than wOBA, which helps explain why it has a stronger correlation with future ERA. This insight encouraged us to look into some more of the relationships between raw metrics (similar to wOBA) and expected metrics (similar to xwOBA), and the subsequent effect on ERA forecasting:

=======

Already, we see that the more advanced analytics have stronger correlations with future ERA than the more rudimentary counting statistics. Additionally, we see that expected weighted on-base average (xwOBA) outperforms weighted on-base average (wOBA). Unlike wOBA, xwOBA attempts to mitigate some of the stochasticity in pitching outcomes by aggregating batted ball data to predict results. In some ways, we can think of xwOBA has a more true measure of pure pitching performance than wOBA, which helps explain why it has a stronger correlation with future ERA. This also allows us to explore the difference between outcome-based statistics (like wOBA) and estimates of expected outcome (like xwOBA) as a measure of a pitcher's "luck", that is how much better or worse batters performed against the pitcher than expected (based on the quality of caontact allowed). A further extension of comparing the differences in outcome and expected outcome is to determine if those differences tend to correct themselves over time. To do so, we analyze the change in a pitcher's ERA from one season to the next as explained by the difference between different predictive and raw measures:

>>>>>>> 10ce5a14b5b3ac8398fed932a0f2c507c35fe9dc
pitcher_data %>%
  ggplot(aes(x = `BABIP - Mean BABIP`, y = `ΔERA`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$`BABIP - Mean BABIP`, pitcher_data$`ΔERA`)
## [1] -0.355732
pitcher_data %>%
  ggplot(aes(x = `xBA - BA`, y = `ΔERA`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$`xBA - BA`, pitcher_data$`ΔERA`)
## [1] 0.3409661
pitcher_data %>%
  ggplot(aes(x = `xwOBA - wOBA`, y = `ΔERA`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitcher_data$`xwOBA - wOBA`, pitcher_data$`ΔERA`)
## [1] 0.3502287
<<<<<<< HEAD

Explain luck reversion. Better predictor than just ERA.

Difference between raw and expected metric in accordance with ERA should be a strong predictor of future ERA (due to the correlation with change in ERA).

Later: add corner %, meatball %, barrel %, ERA/barrel %. Add 2019 to data set

pitch_data  %>%
  ggplot(aes(x = `Average Pitch Speed`, y = BAA)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm, color = "grey50") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

cor(pitch_data$`Average Pitch Speed`, pitch_data$BAA)
## [1] 0.08112214
pitch_data  %>%
  ggplot(aes(x = `Average Pitch Speed`, y = BAA, color =`Pitch Category`)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = lm) +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Conclusion: average pitch speed has no meaningful relationship with BAA, even when the data is almost perfectly seperated by categorical variable level. This was also true of average spin rate. Results are consistent with different response choices (BAA, wOBAA, SLGA, etc.). In other words, pitch tracking data has no predictive power when it comes to modelling outcome and/or pitcher performance. This explains why Statcast uses batted ball variables (such as launch angle and exit velocity) as the predictors in their models for expected outcomes.

=======

As we see in the data, there is clear statistical evidence of what we could call "mean-reversion of luck". Pitchers whose actual performances were worse than expected performances generally saw an increase in ERA the next season as there "luck" reverted closer to the mean. The opposite can be said for Pitchers who performed better than exoected. This conclusion suggests that including differences in outcome-based statistics and expected statistics should significant improve the predictive power of our model.

Conclusion

From our preliminary data exploration, it appears that discrepancies exist between the predictors that are associated with salary and the predictors associated with future performance. Specifically, it seems like raw, unadjusted counting stats associated with volume of pitches rather than quality are highly correlated with salary; while advanced statistics, especially statistics that calculate expected values, are better predictors of future performance. We also elucidated the phenomenon of a "mean-reversion of luck" which suggests that a pitcher's expected performance, derived from statistics such as xWOBA and xBA rather than just WOBA and BA, can help remove some of the noise associated with luck and stochastic fluctuations. Overall, it appears that the differences in strength of correlation between salary and future performance are significant, our next step will be to create and fine-tune models to predict both response variables to both futher delineate what predictors the models differ on as well as investigate both over- and under-paid or valued pitchers.

>>>>>>> 10ce5a14b5b3ac8398fed932a0f2c507c35fe9dc